Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9611 / 000159_owner-urn-ietf _Thu Nov 14 08:55:31 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 12KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id IAA23126 for urn-ietf-out; Thu, 14 Nov 1996 08:55:31 -0500 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id IAA23116 for <urn-ietf@services.bunyip.com>; Thu, 14 Nov 1996 08:54:45 -0500 Received: from josef.ifi.unizh.ch by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA19360 (mail destined for urn-ietf@services.bunyip.com); Thu, 14 Nov 96 08:54:05 -0500 Received: from ifi.unizh.ch by josef.ifi.unizh.ch id <00564-0@josef.ifi.unizh.ch>; Thu, 14 Nov 1996 14:51:53 +0100 Subject: Re: [URN] I18N does not belong in URNs To: Dirk.vanGulik@jrc.it Date: Thu, 14 Nov 1996 14:51:52 +0100 (MET) Cc: FisherM@is3.indy.tce.com, moore@cs.utk.edu, girod@LCS.MIT.EDU, tallen@fsc.fujitsu.com, urn-ietf@bunyip.com In-Reply-To: <9611141038.AA01418@ jrc.it> from "Dirk vanGulik" at Nov 14, 96 11:38:37 am Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Content-Length: 10713 From: Martin J Duerst <mduerst@ifi.unizh.ch> Message-Id: <"josef.ifi..786:14.10.96.13.51.59"@ifi.unizh.ch> Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Martin J Duerst <mduerst@ifi.unizh.ch> Errors-To: owner-urn-ietf@bunyip.com Dirk.vanGulik wrote: >> Dirk.vanGulik wrote: > >> >Or they use a naming scheme dependent interpretation. You could for >> >example simply limit the representation of the URN to the glyphs >> >A-Z, 0-9 and say the dot, dash, and colon. >> >> Too much in favor of English users (+Hawaiian and Suwaheli)! > >Well as a dutch person working in italy in a swiss building; I >can quite live with it :-) No just kidding, but seriously why >does this 'favour' ? And where does it come in ? This has been mentionned before: name spaces in the English-speaking world usually contain letters and numbers, and their mapping to the URN syntax you propose above would be straightforward. Neither the designer of an URN namespace nor the actual users would have to think much about it. Name spaces in countries that use other scripts quite often contain letters in these scripts. Mapping to the URN syntax you propose above would not directly be possible. Namespace designers would invent their own schemes, and namespace users would have to learn all kinds of different conventions for different namespaces. I hope that you see that this favors English-speaking users (and to a lesser extent other users of the Latin script). >> > I recently came across a specific scheme, say 'crdis' which has a lot of >> > LocalControlIdentfiers which can (only be fully) expressed in 2-byte octets. >> > This obviously gave problems as some of our z39.50 servers and HTTP/URN >> > resolving code did not quite like this. A pragmatic solution was to simply >> > base64 encode the identfication string. > >> > A specialist GUI could now base64-decode the string to arrive >> > at something meaningfull for a human. But it does not have to. >> > And it does keep things simple. > >> I have mentionned earlier the possibility that some namespaces may >> contain arbitrary data, as opposed to characters, and I have therefore > >Hold on, I was trying to convey that these strings did _INDEED_ contain >meaningfull text; You only said they contained 2-byte octets. This could be the resource number of a system icon, or whatever. >auctually the various 'strings' where short >titles in the 19 european languages; carefully protected. Probably you have to explain more details for us to get the full picture of this example. Anyway, if they originally were short titles, i.e. character strings, why invent a complicated name space specific mapping to URNs? Why not just define a default mapping (i.e. UTF-8 + %HH) for all possible characters (i.e. the full ISO 10646 UCS-4)? >The same could be said of my phone number > > +39 332 78 0014 > >Which tells a local here that it is me, living in Ispra(78), near Varse >(33*)in italy(39). But for most people those digits make no sense all, >nor have to. And my local phone technician might even tell me more by >just looking at it. We are not really talking about meaning here. What we are talking about is the mapping of existing namespaces (however meaningful or meaningless they are to whoever) to URNs. >> >I really would like to roll up the charset discussion; I agree that for a lot >> >of scheme's, in particular those to be grandfathered in, one will need very >> >flexible encodings. > >> They need to be very flexible in the sense that they have to be >> able to accomodate a very large set of characters, and something >> else than characters in some cases. > >Well, that is a requirement I do not find in the functional RQ RFC at >all. Should that be re-written ? I agree that publishers and maintainers >might like to 'overload' the content; but that is a different issue. The RQ RFC clearly states that we have to be able to grandfather existing namespaces. We can do this mainly in two ways: - Not defining a mapping from arbitrary characters of the universal character set to our URN syntax, and just tell namespace designers: Solve the problem for yourself. In theory, the grandfathering requirement is satisfied (even if we restrict syntax to only two different characters). The problem is that namespace implementors have to come up with their own ideas of how to map characters to URNs, resulting in chaos and inconvenience for everybody. With this, we get rid of the grandfathering requirement by delegating it, but we do not really do our job. - Defining a mapping from the characters of the universal character set (UCS) to URN syntax. This will make implementing existing namespaces in URNs much easier. We really deal with the grandfathering requirement, we don't just pass it around like a hot potato (it's not hot and quite easy to handle anyway). We seriously do our job as well as we are possibly able to do it. We avoid chaos where there is absolutely no need for chaos. >Hmm, I think we have a slight terminology clash; much like the problems >we have had with the URL rfc. There are several layers, which their >own dimensions. In URLs (correct me if I am wrong Larry) this is solved >by saying that the acutal URL is an octet-stream. These 8-bit encoded >values are all there is. However by 'pure coincidence' they can be >treated as indexes into a charset such as US-Asicc or latin-1 and >actually yeild something which humans can interpret quite easily. > >But for example the first 5 values (say http:) are not the glyph >'h', 't', p'p and ':' but the values 68747470 or 4854545. (Upper/ >lowercase), You got it the wrong way round; please reread RFC 1738 if you don't believe me. URLs are defined as strings of characters that can exist on paper or iside a computer (in ASCII or EBCDIC or whatever). This means that in the URL RFC, the "is" is used for the representING side. The representED side are indeed octets that appear in a protocol. The fact that you got terminology wrong while you did not say much that was actually wrong (if one is restricted to ASCII, which is the case more often than not) shows that writing what URL/Ns "are" can lead to a lot of confusion. The fact that you mention several layers also indicates that it is dangerous to make one of these layers special by using the verb "to be". Different people tend to identify different layers as the most important one. In my terminology, RFC 1738 says: (1) URLs are represented by characters (on paper, in ASCII, in EBCDIC, or whatever). (2) URLs represent octets of some protocols. RFC 1738 also defines the mapping from the octets in (2) to the characters in (1). What we need for URNs is at least: (1) URNs are represented by characters (on paper, in ASCII,...) from the same set as used in URLs. (2) URNs (in general) represent characters (from UCS-4). The mapping from the large character set in (2) to the small character set in (1) has to be defined. The one and only existing and reasonable proposal for this is to use UTF-8 and combined with some octet encoding such as %HH. >So what I was saying is that the URN is an octet stream; and the >allowed values are from ox30 to 0x39, 0x41--0x5a etc. Which happen >to represent indexes into latin-1, UTF-8 or ascii. But not EBCDIC. For those that have difficulties separating the different layers, and have difficulties immagining something such as Japanese, there is an easy trick: Think about EBCDIC. If you don't at least think about how to handle things in EBCDIC, the chances are big that you will forget some important aspect of the problem. >Now a clever administrator (and a clever GUI) can use something like >base64 to get a nice UniCode string in. We don't want the administrator to have to come up with such a mapping, and we don't want or need to have many different mappings. By just a little bit of effort from our side, we don't need administrators to be overly 'clever', and we need GUIs to only support exactly one kind of presentation conversion (which will really soon be available in the average Web browser) instead of dozens, hundreds, or thousands of them (which no Web browser will ever support). I could live with the definition that an URN "is" an octet stream if we add (as we have in the current version of the syntax draft) a specification for how characters from the universal character set (UCS) should be encoded into URN octets. >> > 3. And keep a few chars (say the %) in stock, for the future. > >> > 4. And remember, one can always make something like a next generation >> > URN+:das:asd which can only be transcribed properly using say UTF8. > >> There is absolutely no need to wait any longer for UTF-8. > >Well, I can agree; but not for the premisse; I do not see the need >for the charset flexibilit. I know people can get quite religious about >their names; so you have a point that we should be flexible to accomodate, >but on the other hand it does make implementation harder whilst the >functionality does not increase a bit. It's not about peoples' names, unless you consider that a namespace suitable for URNs. It's about actually existing namespaces that contain characters beyond the set you are proposing. As for implementation, you are free to provide your browser with a %HH interface only. For the current syntax draft, you would have to add a routine that checks for the occurrence of 8-bit bytes, and you would have to convert them to %HH before comparison. That's about five lines of C code, not really a hard thing to do. As for implementing a user interface that allows input and rendering of the full universal character set (UCS), that indeed is quite some work, but (1) there is no need to provide full implementation of all characters, (2) as far as it is needed, it will be done anyway in the average browser because of requirements from other protocols and formats (such as HTML), and (3) implementing this is still tremendously simpler than implementing GUIs for all the 'clever' encoding systems that all those 'clever' administrators have come up with. >> All your requirements can be fulfilled by: > >> - Allowing non-character namespaces to create their own non UTF-8 encodings >> (as I have suggested previously). >> - Require that internally, all 8-bit octets resulting from UTF-8 encoding >> of characters beyond ASCII (+some in ASCII) be encoded with >> %HH, so that there is no 8-bit from of URNs. > >> Although I have good reasons to think that the second point is not needed, >> I could live with it. But giving up a convention such as UTF-8 to map >> arbitrary characters in namespaces to URNs would be a great loss. > >Well I think I do not see those requirements; but perhaps we should look >at the RQ RFC again, to see what possibly is missing. The above requirements are possible refinements of the general requirements for grandfathering and transcribability. Regards, Martin.